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We describe aspects of the Chroma software system for lattice QCD calculations. Chroma is an open source 
C++ based software system developed using the software infrastructure of the US SciDAC initiative. Chroma 
interfaces with output from the BAGEL assembly generator for optimised lattice fermion kernels on some archi- 
tectures. It can be run on workstations, clusters and the QCDOC supercomputer. 



1. INTRODUCTION 

We present the Chroma software system £Q 
for lattice QCD (LQCD) calculations. Chroma 
aims to provide a computational LQCD toolbox 
which is flexible, portable and efficient on a wide 
range of architectures from desktop workstations 
to clusters, commercial machines and new archi- 
tectures such as the QCDOC [2] 

Development on Chroma started at the JLab 1 
for the U.S. lattice community, in particular the 
LHPC collaboration [3], using software from the 
U.S. SciDAC initiative g). This effort has been 
joined by the UKQCD collaboration [5] who have 
been contributing to the effort on all levels. 

To achieve the goals of flexibility, portability 
and efficiency, Chroma relics on several layers of 
SciDAC and UKQCD software. 

1.1. SciDAC Software Hierarchy 

Toward the end of 2000, the U.S. Lattice 
community embarked on an ambitious project 
through the U.S. SciDAC initiative to standard- 
ise a set of software components in order to allow 
the effective exploitation of computing resources 
for LQCD. The following levels of software infras- 
tructure were defined: 

QCD Message Passing (QMP): provides 
a message passing API tailored to LQCD 
calculations. QMP was designed to take 
advantage of the specialised communication 
hardware of emerging architectures, such as 
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the Serial Communications Unit (SCU) of 
the QCDOC, or the capabilities of Myrinet 
Network devices. 

Level 1 QCD Linear Algebra (QLA): 

defines operations to be performed at each 
site of the lattice. Primitives include SU(3) 
matrix-matrix and matrix-colour vector 
mult iplications. 

Level 2 QCD Data Parallel (QDP): pro- 
vides lattice-wide operations such as basic 
linear algebra for lattice-wide fields. 

Level 3 Special optimised software: will 
provide portable interfaces to highly op- 
timised and machine-dependent pieces of 
code such as assembly-coded Dslash oper- 
ators. 

A recent addition to this hierarchy is the QCD 
I/O (QIO) sublayer which provides a record ori- 
ented I/O API. XML metadata and binary data 
can be packaged as separate records of a single 
file for transmission to future Web/Grid services. 

Level 3 of the SciDAC hierarchy is not yet 
as mature as the other levels, hence Chroma 
uses optimised Dslash operators from a variety 
of sources, such as the Pentium 4 SSE assembler 
code developed at the JLab 6 , or the output of 
the BAGEL [7] generator. 

2. QDP++ 

QDP++ is a C++ implementation of the QDP 
level of the SciDAC hierarchy, which Chroma 
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uses as its foundation. QDP++ defines lattice- 
wide types and allows lattice-wide expressions. It 
has been integrated with the QIO framework and 
with modules to provide XML based I/O (e.g. 
for parameter reading, or processing QCDML 
markup 8 ). 

2.1. Lattice Wide Types 

QDP++ models the tensor product structure 
of LQCD objects through a series of nested tem- 
plates. E.g: the indices of lattice fermion fields 
can follow the structure: 

sites <g) spins <g> colours ® complex numbers 

QDP++ would model this through the following 
C++ templated type: 

QLattice< PSpinVector< PColorVector< 
RComplex < REAL >, Nc >, Ns > > 

where REAL is the type of the complex compo- 
nents, defined as either float or double. Nc and 
Ns are the numbers of spin and colour compo- 
nents defined during configuration. 

QDPH — h operates on such types by recursing 
down the tower of templates. To multiply the 
above fermion type by a scalar, one would first 
loop through the indices of the OLattice type 
and for each one, call the multiply operation for 
the PSpinVector type and so on. 

The order of the templates is not fixed in prin- 
ciple and could be permuted in order to take ad- 
vantage of different parallel architectures. The 
templated types are aliased to fixed type names 
such as LatticeFermion that are used by higher- 
level codes. 

2.2. Template Expressions, PETE 

QDP++ provides lattice-wide arithmetic ex- 
pressions by overloading operators. The AXPY 
operation 

z <— ax + y (1) 

for lattice fermion fields x,y,z and scalar a is 
coded as: 

z = a*x + y. (2) 

To eliminate temporaries in expressions, 
QDP++ uses the Portable Expression Template 



Engine (PETE) 2 0. The C++ binary operators 
are redefined to transform the expression into an 
expression template type. On assignment, 
storage for the result and the expression tem- 
plate, which includes references to the operands, 
are all given as arguments to an evaluate () func- 
tion. This is done at compile time through 
the template instantiation mechanism. The 
process for the AXPY operation of Eq. (JIJ is 
shown in Fig. ^ 

Step 1: operator *(): ( Real(alpha), Vectorfx) ) 

| Returns Dressed Templated Type 

Step 2: operator +()■'( QDPExpr<...> , VecTor(y) ) 

Returns Dressed Templated Type 

V 

QDPExpr< BinaryNode< OpAdd, 

B inaryNode< OpMidtiply, Real(alpha), Vector(x)>, 
Vector(y) >, Vectorf result) > 

I 

Step 3: operator=() : f QDPExpr< ... > ) { 

evaluate ( QDPExprt<...> _ *tkis ); // *this is Vectorf result) 
return *this ; // Returns reference only 

} 

Figure 1. Expression Transformation 

Templated evaluate() functions can be spe- 
cialised for specific expression templates, and the 
specialised functions can call optimised subrou- 
tines, e.g. fast AXPY like functions. 

It may be possible to optimise general expres- 
sions even further by applying PETE techniques 
at all levels of our template hierarchy rather than 
just at the OLattice level as is done now. 

3. CHROMA 

Chroma developed to serve the needs of the 
LHPC and UKQCD collaborations, which have 
included spectroscopy, decay constant, nucleon 
form factor and structure function moment cal- 
culations. The authors are also interested in in- 
vestigating chiral fermion actions. Hence the 
code contains Wilson, Domain Wall and Over- 
lap fermion operators and numerous inverters. 

2 PETE is a stand alone component of the POOMA library 
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Code also exists to compute hadronic 2-point 
and 3-point correlation functions. Furthermore, 
UKQCD researchers have written an ASqTAD 
fermion operator in the staggered library. 

3.1. Chroma and Efficiency 

By optimising QDP++ as discussed previously 
and using high performance kernels in our linear 
operators we have achieved high performance on 
workstations, clusters and the QCDOC. 

Fig. [5]shows single-precision performance mea- 
surements with an even-odd preconditioned Wil- 
son Dirac operator on 4 nodes of a QCDOC using 
the assembler Dslash from BAGEL. We show the 
performance of the Dslash, the Dirac operator, a 
naively implemented Conjugate Gradients (CG) 
solver and an optimised CG solver, which makes 
direct calls to the Dslash routine and fuses the 
application of the Dirac Matrix with the compu- 
tation of the norm of the result where possible. 

The Dslash routine (from BAGEL) was de- 
signed to amortise communications latency at a 
local lattice size of 256 sites. Scaling from this 
point to larger volumes can be seen in Fig. [2 The 
cost of the additional vector operations needed to 
combine the Dslash applications into the Dirac 
operator appears as the difference between the 
red and black bars. The difference between the 
general CG and the optimised one is about 2-3% 
of peak over the whole range of volumes, show- 
ing that the overhead from QDP++ expressions 
is very small. 
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4. CONCLUSIONS 

We have discussed the SciDAC software hier- 
archy and elements of the QDP++ and Chroma 
software suites for lattice QCD and have shown 
how the latter may be optimised in order to 
achieve high efficiency through template expres- 
sions and by interfacing cleanly with third party 
high performance libraries. 

Due to lack of space we have had to gloss 
over several other nice features of Chroma and 
QDPH — h such as the many applications already 
existing in the code base and our build system 
which uses the GNU Autotools, allowing for easy 
configuration and building. These details are de- 
ferred to a forthcoming publication. 

Our future plans include the implementation 
of gauge configuration generating algorithms for 
quenched and dynamical fermions. At the time 
of writing, the basic class framework for this is 
already in place. 

QMP, QDPH — h and Chroma are open source 
software and are freely available through anony- 
mous CVS, details of which can be found in . 
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Figure 2. Performance on the QCDOC 
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